Day 2 Session 1:
Dynamic spectral analysis
using Functional Principal Component Analysis (FPCA)
Introduction
Principal Component Analysis (PCA)
Let’s apply Principal Component Analysis (PCA) onto the static data.
Preliminaries
Importing data set
Checking data and data wrangling
As usual, let’s check what variables are included in the data set
using colnames().
## [1] "...1" "file" "speaker" "language"
## [5] "duration" "segment" "word" "f1"
## [9] "f2" "f3" "previous_sound" "next_sound"
## [13] "percent" "IsApproximant" "IsAcoustic" "gender"
## [17] "omit" "position"
Also remove irrelevant columns and rename the variables in the
vowel column as has been done previously.
# Remove columns
df_mid <- df_mid |>
dplyr::select(-c(IsApproximant, IsAcoustic, omit))
# Rename vowel variables
df_mid <- df_mid |>
dplyr::mutate(
vowel = case_when(
next_sound == "AE1" ~ "/æ/",
next_sound == "IY1" ~ "/i/",
next_sound == "UW1" ~ "/u/",
)
)
# z-normalise formant values for cross-speaker comparison
df_mid <- df_mid |>
dplyr::group_by(speaker) |>
dplyr::mutate(
f1z = scale(f1),
f2z = scale(f2),
f3z = scale(f3)
) |>
dplyr::ungroup()Data visualisation
Let’s try data visualisation. First, we’ll compare between-group
difference using a combination of scatter, box and violin plots. In
order to plot everything altogether, we’ll convert the data set into
long data using the tidyr::pivot_longer() function.
# convert the data set into a long data
df_mid_long <- df_mid |>
dplyr::select(-c(f1, f2, f3)) |>
tidyr::pivot_longer(14:16, names_to = "formant", values_to = "values")
# plot
df_mid_long |>
ggplot(aes(x = language, y = values)) +
geom_jitter(aes(colour = language), alpha = 0.5) +
geom_violin(alpha = 0.4) +
geom_boxplot(width = 0.3, alpha = 0.4) +
facet_grid(vowel ~ formant) +
scale_colour_manual(values = cbPalette) +
theme(
strip.text.y = element_text(angle = 360)
)Let’s also take a slightly different approach. The objective of PCA is to draw straight lines along the dimension that shows the greatest variance. To inspect the variance in the data, we’ll plot the relationships between the three spectral dimensions.
# F1 and F2
df_mid |>
ggplot(aes(x = f1z, y = f2z)) +
geom_point(aes(colour = language), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(~ segment)# F2 and F3
df_mid |>
ggplot(aes(x = f2z, y = f3z)) +
geom_point(aes(colour = language), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(~ segment)# F1 and F3
df_mid |>
ggplot(aes(x = f1z, y = f3z)) +
geom_point(aes(colour = language), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(~ segment)As expected, English /l/ and /ɹ/ can be reliably distinguished along the F3 dimension, such that English /ɹ/ shows lower F3 than English /l/.
The key takeaway here, however, is that we always need multiple plots in order to obtain a holistic understanding of the data structure, which is somewhat cumbersome. Here, PCA comes in handy, as it is a dimension reduction technique and boils down the variance in the data into a handful of principal components.
Applying PCA
Let’s apply PCA on the formant data. We’ll classify the tokens based
on the z-scored F1, F2 and F3. We use the prcomp() function
to perform PCA.
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.1297 0.9955 0.8218
## Proportion of Variance 0.4337 0.3368 0.2295
## Cumulative Proportion 0.4337 0.7705 1.0000
The summary shows three rows:
Standard deviation: This fundamentally shows the variance shown in the data. The greater the value, the more variance is captured by each PC.
Proportion of Variance: This expresses the amount of variance explained by each PC in a proportional manner. This can be calculated with the following formulae:
\[ pca\_var\_exp = \frac{(\text{sdev})^2}{\sum pca\_var} \]
- Cumulative Proportion: This simply shows the sum of the proportion of variance explained by each PC. For instance, PC1 explains 43.37% of variance whereas PC2 33.68%. The cumulative proportion of PC1 and PC2 is thus 77.05% (i.e., 43.37% + 33.68%).
PCA summary
Let’s also look into the inside of pca_mid.
## Standard deviations (1, .., p=3):
## [1] 1.1297301 0.9955437 0.8217881
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## f1z -0.1095097 0.97205301 0.2076550
## f2z -0.6942959 -0.22430409 0.6838428
## f3z -0.7113093 0.06928656 -0.6994559
The top line shows standard deviation, which we just saw earlier, so let’s skip this for now.
At the bottom, we have loadings matrix. This shows eigenvectors: i.e., the amount and (2) the direction of the contribution that the original data dimensions make to each PC. Let’s disentangle this one by one:
PC1: Contributions of
f1z,f2zandf3zare all in the same direction (i.e., negative). But the amount of contribution is fairly small forf1zcompared tof2zandf3z.PC2: In contrast to PC1,
f1zcontributes to PC2 to a greater extent thanf2zandf3z.PC3: The amount of contribution itself is quite similar to PC1, in which
f2zandf3zmakes a substantial contribution. But the directionality is opposite betweenf2zandf3z.
So, all in all, PCA might suggest:
PC1 captures a covariation between
f2zandf3z, which may correspond to the spectral difference between /l/ and /ɹ/.PC2 mainly captures the variation of
f1z.PC3 captures an inverse relationship between
f2zandf3z, which might correspond to differences in coarticulatory effects.
The interpretation may be better facilitated by data visualisation. Let’s try two main approaches to data visualiastion of PCA.
Data visualisation 1: Scree plot
The amount of variance is useful when determining which PC to retain for analysis. As a rule of thumb, Bayeen (2008) recommends that we retain PCs that explains greater than 5% of variance in the data.
A useful visualisation of the proportion of variance is scree
plot. There is a default function screeplot() that
lets you create a scree plot quite easily:
You can also create a scree plot using ggplot() by
manually calculating the proportion of variance based on the standard
deviation (see above for the formula). This is useful when you need more
flexibility – e.g., to show the 5% thereshold suggested by Bayeen
(2008).
## obtaining proportion of variance
pca_var_exp <- pca_mid$sdev^2 / sum(pca_var)
# making var_explained as a tibble
pca_var_exp <- tibble::as_tibble(pca_var_exp)
# adding column name
pca_var_exp <- pca_var_exp |>
tibble::as_tibble() |>
dplyr::mutate(
PC = row_number()
)
# create a plot
sclee_plot <- pca_var_exp |>
ggplot(aes(x = PC, y = value)) +
geom_line() +
geom_text(aes(label = round(value, digits = 3)*100), nudge_x = 0.2) +
geom_label(aes(label = PC), label.padding = unit(0.40, "lines")) +
geom_hline(yintercept = 0.05, linetype = 'dotted') +
scale_x_continuous(breaks = c(1, 2, 3)) +
labs(x = "Principal Component", y = "Variance Explained", title = "Proportion of Variance explained by each PC")
# showing plot
sclee_plot
Since there are only three PCs identified from the data, it doesn’t make
much sense to visualise them. But this will be useful when
prcomp() identifies more than a few PCs.
creating biplot
The interpretation of loading matrix was more or less straightforward in this case given that only three parameters were involved. It is easy to see, however, that this can get quite complicated when the original data have a greater number of dimensions.
Biplots help us better understand the loadings and
PC scores for each observation. For example, the default function
biplot() plots all observations along the PC1 and PC2
dimensions with eigenvectors shown in red arrows (i.e.,
the direction and the amount of weighting: also called
loadings).
Similarly to the scree plot, we could also create biplots manually using
ggplot().
# Extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)
# Extract PCA loadings (the contribution of each original variable to the PCs)
pca_loadings <- as.data.frame(pca_mid$rotation)
# Combine scores and loadings into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel")])
# Reshape the loadings for visualization
loadings_df <- pca_loadings |>
mutate(variable = rownames(pca_loadings))
# Plotting the PCA biplot with facets by language
ggplot() +
geom_point(data = scores_df, aes(x = PC1, y = PC2, color = vowel, shape = vowel), size = 2, alpha = 0.5) + # Plot the PCA scores (observations)
geom_segment(data = loadings_df, aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), arrow = arrow(type = "closed", length = unit(0.1, "inches")), color = "red") + # Plot the loadings (variables)
geom_text(data = loadings_df, aes(x = PC1 * 5, y = PC2 * 5, label = variable), size = 4, color = "red") + # Add labels for the loadings
stat_ellipse(data = scores_df, aes(x = PC1, y = PC2, color = vowel, fill = vowel), geom = "polygon", alpha = 0.2, level = 0.95) +
facet_grid(segment ~ language) +
labs(title = "PCA biplot by language") +
scale_color_manual(values = cbPalette) +
scale_fill_manual(values = cbPalette) +
theme(legend.position = "bottom",
strip.text.y = element_text(angle = 360))Functional Principal Component Analysis (FPCA)
Preliminaries
Installing/loading packages
Importing data set
Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.
Data wrangling
Check data
We always start with inspecting the data set using
colnames().
## [1] "...1" "file" "speaker" "language"
## [5] "f0" "duration" "segment" "previous_sound"
## [9] "next_sound" "word" "f1" "f2"
## [13] "f3" "time" "IsApproximant" "IsAcoustic"
## [17] "note" "gender" "omit" "Barkf1"
## [21] "Barkf2" "Barkf3" "f2f1" "f3f2"
## [25] "Barkf2f1" "Barkf3f2" "position" "context"
## [29] "liquid" "input_file"
Omitting irrelavent columns
We’ll omit the columns we don’t need.
# Let's check the number of "approximant" tokens
df_dyn |>
dplyr::group_by(IsApproximant) |>
dplyr::summarise() |>
dplyr::ungroup()## # A tibble: 1 × 1
## IsApproximant
## <chr>
## 1 yes
# Let's check the number of tokens of good recording quality
df_dyn |>
dplyr::group_by(IsAcoustic) |>
dplyr::summarise() |>
dplyr::ungroup()## # A tibble: 1 × 1
## IsAcoustic
## <chr>
## 1 yes
## # A tibble: 1 × 1
## omit
## <dbl>
## 1 0
# Remove columns that we no longer need
df_dyn <- df_dyn |>
dplyr::select(-c(IsApproximant, IsAcoustic, omit, Barkf1, Barkf2, Barkf3, Barkf2f1, Barkf3f2, f2f1, f3f2))Let’s check the column names again.
## [1] "...1" "file" "speaker" "language"
## [5] "f0" "duration" "segment" "previous_sound"
## [9] "next_sound" "word" "f1" "f2"
## [13] "f3" "time" "note" "gender"
## [17] "position" "context" "liquid" "input_file"
Let’s also convert the context column into IPA symbols
for a more intuitive representation:
Checking the number of participants, tokens…
Let’s also obtain some descriptive statistics here. Note that we need to divide the number of rows by 11 to obtain the accurate number of tokens, as one token now has 11 time points.
# number of participants
df_dyn |>
dplyr::group_by(language) |>
dplyr::summarise(n = n_distinct(speaker)) |>
dplyr::ungroup()## # A tibble: 2 × 2
## language n
## <chr> <int>
## 1 English 14
## 2 Japanese 31
# number of tokens per segment
df_dyn |>
dplyr::group_by(segment) |>
dplyr::summarise(n = n()/11) |> # divide by 11 time points
dplyr::ungroup()## # A tibble: 6 × 2
## segment n
## <chr> <dbl>
## 1 LAE1 511
## 2 LIY1 408
## 3 LUW1 299
## 4 RAE1 502
## 5 RIY1 481
## 6 RUW1 314
Data visualisation
Scaling formant frequencies
Do you remember how to visualise the dynamic data? The basic procedure is the same as in the static analysis; We first apply z-score normalisation to the formant frequencies to make sure that formant values are comparable across speakers.
df_dyn <- df_dyn |>
dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
dplyr::mutate(
f1z = as.numeric(scale(f1)), # scale f1 into z-score
f2z = as.numeric(scale(f2)), # scale f2 into z-score
f3z = as.numeric(scale(f3)) # scale f3 into z-score
) |>
dplyr::ungroup() # don't forget ungroupingDescriptive statistics
Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.
# check mean and sd of raw/scaled F1 values for each speaker
df_dyn |>
dplyr::group_by(speaker) |>
dplyr::summarise(
f1_mean = mean(f1),
f1_sd = sd(f1),
f1z_mean = mean(f1z),
f1z_sd = sd(f1z)
) |>
dplyr::ungroup() ## # A tibble: 45 × 5
## speaker f1_mean f1_sd f1z_mean f1z_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2d57ke 468. 140. -9.05e-17 1
## 2 2drb3c 575. 243. 2.39e-16 1
## 3 2zy9tf 459. 228. -2.41e-16 1
## 4 3bcpyh 438. 142. -5.72e-18 1.00
## 5 3hsubn 537. 177. 6.11e-17 1
## 6 3pzrts 458. 195. -1.44e-18 1.00
## 7 4ps8zx 467. 172. 2.19e-16 1
## 8 54i2ks 453. 192. 1.43e-16 1.00
## 9 5jzj2h 412. 133. 2.49e-16 1.00
## 10 5upwe3 444. 189. -6.51e-17 1.00
## # ℹ 35 more rows
Visualisation
raw trajectories
Let’s visualise the raw data first:
# F2 - raw trajectories
df_dyn |>
ggplot(aes(x = time, y = f2z)) +
geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F2 (z-normalised)", title = "time-varying change in F2 frequency") +
theme(strip.text.y = element_text(angle = 0))smooths
Let’s also try plotting smoothed trajectories:
# F2 - smooths
df_dyn |>
ggplot(aes(x = time, y = f2z)) +
# geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
# geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
geom_smooth(aes(colour = language, group = language), linewidth = 1.2, se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F2 (z-normalised)", title = "smoothed time-varying change in F2 frequency") +
theme(strip.text.y = element_text(angle = 0))## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Functional Principal Component Analysis (FPCA)
At this point, we tried fitting Generalised Additive Mixed-Effect Models (GAMMs) to investigate between-group differences in F2 dynamics. This was possible because we had some ideas of possible factors (e.g., vowel context, speaker groups). In a way, lots of decisions were made in a top-down manner.
We could also try another approach; a bottom-up approach. That is, does the data tell us anything about higher-order groupings? Do the vowel context/speaker group differences emerge from the data?
Here, let’s try Functional Principal Component Analysis (FPCA) as a bottom-up approach.
# IDs = token column; tVec = time column; yVec = variable column(s)
input_df <- fdapace::MakeFPCAInputs(IDs = df_dyn$file, tVec = df_dyn$time, yVec = df_dyn$f2z)
# Check if there's any issues with the data
fdapace::CheckData(input_df$Ly, input_df$Lt)
# No errors have been returned, so let's now run fPCA on the dynamic PC1 trajectory
df_dyn_fpca <- fdapace::FPCA(Ly = input_df$Ly, input_df$Lt, optns = list(plot = TRUE))checking fpca results
## [1] 73.431312 13.589246 4.373711 1.285172
## [1] 0.7859692 0.9314212 0.9782350 0.9919908
## [1] 0 10 20 30 40 50 60 70 80 90 100
## [,1] [,2] [,3] [,4]
## [1,] 1.3978292 5.260927 2.1324070 -2.52256410
## [2,] -1.5626684 5.595501 0.2863422 -1.84886236
## [3,] -0.6166366 4.291694 1.1572724 -1.35058648
## [4,] -5.5315817 2.679823 0.1415474 0.08740263
## [5,] 0.5428989 3.542994 3.8331163 -2.47430535
## [6,] -2.7433146 7.071773 -2.1509403 -1.88748022
## $mu
## [1] -0.70831425 -0.56725924 -0.32739985 -0.04873545 0.18428169 0.31887341
## [7] 0.37257545 0.37071962 0.31025622 0.17889910 -0.08389668
##
## $workGrid
## [1] 0 10 20 30 40 50 60 70 80 90 100
##
## $bwMu
## NULL
##
## $optns
## $optns$userBwMu
## [1] 5
##
## $optns$methodBwMu
## [1] "Default"
##
## $optns$userBwCov
## [1] 10
##
## $optns$methodBwCov
## [1] "Default"
##
## $optns$kFoldMuCov
## [1] 10
##
## $optns$methodSelectK
## [1] "FVE"
##
## $optns$FVEthreshold
## [1] 0.99
##
## $optns$FVEfittedCov
## NULL
##
## $optns$fitEigenValues
## [1] FALSE
##
## $optns$maxK
## [1] 9
##
## $optns$dataType
## [1] "Dense"
##
## $optns$error
## [1] TRUE
##
## $optns$shrink
## [1] FALSE
##
## $optns$nRegGrid
## [1] 11
##
## $optns$rotationCut
## [1] 0.25 0.75
##
## $optns$methodXi
## [1] "IN"
##
## $optns$kernel
## [1] "epan"
##
## $optns$lean
## [1] FALSE
##
## $optns$diagnosticsPlot
## NULL
##
## $optns$plot
## [1] TRUE
##
## $optns$numBins
## NULL
##
## $optns$useBinnedCov
## [1] TRUE
##
## $optns$usergrid
## [1] FALSE
##
## $optns$yname
## [1] "Ly"
##
## $optns$methodRho
## [1] "vanilla"
##
## $optns$verbose
## [1] FALSE
##
## $optns$userMu
## NULL
##
## $optns$userCov
## NULL
##
## $optns$methodMuCovEst
## [1] "cross-sectional"
##
## $optns$userRho
## NULL
##
## $optns$userSigma2
## NULL
##
## $optns$outPercent
## [1] 0 1
##
## $optns$useBinnedData
## [1] "AUTO"
##
## $optns$useBW1SE
## [1] FALSE
##
## $optns$imputeScores
## [1] TRUE
# function: get PC scores + return data frame with PCs for each token
get_pc_scores <- function(fpcaObj){
pcs <- data.frame(fpcaObj$xiEst)
token <- names(fpcaObj$inputData$Lt)
df <- cbind(token, pcs)
n_pcs <- length(fpcaObj$lambda) # get number of PCs
pc_names <- paste0("PC", 1:n_pcs) # create colnames for PCs
names(df) <- c("file", pc_names) # add colnames for token + PCs
return(df)
}
# get PC scores w/ token info
pc_df <- get_pc_scores(df_dyn_fpca)
# join PCs (dat) with selected cols from original data frame
## store meta info
meta <- df_dyn |>
select(speaker, gender, language, word, liquid, context, file)
## merge the list and meta data - unique(meta) because otherwise there would be lots of duplicates
dat <- left_join(pc_df, unique(meta), by = "file")